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ABSTRACT 

Research into, and design and construction of mobile 
systems and algorithms requires access to large-scale 
mobility data. Unfortunately, the wireless and mobile 
research community lacks such data. For instance, the 
largest available human contact traces contain only 100 
nodes with very sparse connectivity, limited by exper- 
imental logistics. In this paper we pose a challenge to 
the community: how can we collect mobility data from 
billions of human participants? We re-assert the impor- 
tance of large-scale datasets in communication network 
design, and claim that this could impact fundamental 
studies in other academic disciplines. In effect, we ar- 
gue that planet-scale mobility measurements can help 
to save the world. For example, through understanding 
large-scale human mobility, we can track and model 
and contain the spread of epidemics of various kinds. 

1. INTRODUCTION 

Human mobility traces are critically important to many 
disciplines in addition to computer networking, ranging 
from epidemiology JT] to urban planning li28l . Unfor- 
tunately, existing traces of human mobility are flawed: 
using traditional social science methods to collect data 
has proven difficult BOI . and traces collected using tech- 
nology methods have suffered from a variety of lim- 
itations. These include small size (the largest is 100 
nodes 0151 ). short duration (the longest is 9 months ifTOll ). 
and high locality (many of the scenarios are limited 
to campus and conference environments ||6l). These 
datasets may not be enough for large mobile system 
evaluations, and are definitely insufficient for epidemi- 
ology, where planet-wide measurements are needed to 
track the spread of disease. 

As members of the networking community, we have 
both the tools and methods (e.g., hardware and soft- 
ware knowledge) to conduct large-scale data collection. 
Furthermore, our contributions will not only benefit the 
wireless and mobile networking research communities, 
but will impact fundamental research in other areas al- 
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lowing more features about human behaviour to be un- 
covered. We believe that the situation is analogous to 
that of complex networks research, which has flour- 
ished since 1989 when the first large datasets from the 
Internet (and subsequently the World Wide Web) be- 
came available JS). To achieve similar improvements 
in mobile networking and other related fields, relevant 
large-scale datasets must be made available. 

In this paper we challenge the community to collect 
large-scale human mobility traces. We highlight some 
of the issues in the hope that the community can help 
find good solutions. In the meantime, we propose some 
solutions intended to form the basis of initial efforts; 
the main aim is to raise these issues to gain community 
support to meet this challenge and make the topic hot 
in the networking community. 

2. WHY ARE LARGE-SCALE HUMAN 

MOBILITY TRACES IMPORTANT? 

As mentioned above, large-scale datasets are useful 
for many aspects of research. In this paper we focus 
only on two aspects: system design and validation, and 
epidemiological studies. 

2.1 System design and validation 

After its first use in the evaluation of Dynamic Source 
Routing fTSl, the random waypoint model became the 
de facto standard mobility model in the mobile net- 
working community. For example, of the ten papers 
in ACM MobiHoc 2002 which considered node mobil- 
ity, nine used the random waypoint model 13TI . This 
trend has changed dramatically over recent years after 
the introduction of real mobility traces for evaluation: 
of the 10 papers considering node mobility in MobiHoc 
2008, 7 used real mobility traces for evaluation. 

The community has realised that unrealistic models 
are harmful for scientific research. Although real traces 
may suffer from limited numbers of participants, coarse 
granularity, and short experimental duration, they at 
least reflect some aspects of real life. Thanks to the 
popularity of Online Social Networks (OSNs), we can 
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now gather large-scale data about the topology and mem- 
bership information of millions of OSN users and use 
these to study aspects of the social networks ll22l 1201 . 
But where is the large-scale dataset for evaluating, for 
instance, inter-city ad-hoc communication using mo- 
bile computing? Or even a single city- wide mobile 
communication system (e.g., a delay-tolerant network, 
or city-wide gaming)? We have very few empirical 
hints for this. Without the help of real data, we can- 
not even know whether this kind of system is possible. 
Even if we extrapolate large-scale mobility traces from 
small-scale traces, the problem of validating the extrap- 
olation remains. 

Instead of using mobility traces directly to run trace- 
driven simulations, a possible approach is to extract 
characteristics from the data and build more realistic 
mobility models. Much work has been done in mod- 
eling human mobility for mobile ad hoc network sim- 
ulation 151. Researchers have proposed more realistic 
models by incorporating obstacles [17], social informa- 
tion 123)1 . and clustering features observed in mobility 
datasets M25I . Analysis of real traces has demonstrated 
power-law inter-contact time distributions with cut off 
l6l [191 . levy-flight patterns consisting of lots of small 
moves followed by long jumps 1261 . heterogenous cen- 
tralities 111 II (i.e., popularity) and clustering structure [|T5l 
But again, these results are from small-scale datasets 
and are limited to specific scenarios with limited time 
durations. Some researchers have extrapolated from 
these by assuming, for instance, that the way people 
move in a city is correlated to the centrality distribution 
of the city graph ||28l . but this has yet to be verified em- 
pirically. Gonzalez et al. lfT3l extracted coarse-grained 
levy-walk properties from large-scale mobile phone us- 
age. The limitation is that the dataset is from cellular 
basestation, which only log when mobile users make a 
call or send an text message. This is very coarse in both 
geographical and temporal granularity. People may ar- 
gue that human behavior should be scale-free in differ- 
ent dimensions, but we need data for further verifica- 
tion. Moreover, since the the data from this study have 
not been released, it is impossible to verify or build on 
their findings. 

We need large-scale human mobility datasets with 
better space and time granularity to verify the proper- 
ties we mention above. Following analogous progress 
in related fields, it seems likely that we will uncover 
many more features from such data which will help us 
to build good models. We beheve that this is crucially 
important for the mobile computing community. 

2.2 Epidemiological studies 

Moving beyond social science, the communication 
network community has also aided research in many 




Figure 1: Metapopulation model composed of a net- 
work of subpopulation connected by mobility. 

other academic disciplines. For instance, our method- 
ology and data make the modeling of human dynam- 
ics possible [f29l, and more significantly, our data made 
possible the development of the field of complex net- 
work research l3]. Large-scale mobile data can fur- 
ther enable the study of epidemic disease spreading. 
The current state-of-the-art in epidemic modeling uses 
data from the International Air Transport Association 
(lATA) commercial airline traffic database to determine 
travel between airports and to provide coarse-grained 
estimates of global spreading patterns Q, as well as 
data of transportation and commuting patterns in urban 
areas, which can be used to model a metapopulation 
mechanism of spreading [8:|. Researchers cannot de- 
velop more microscopic models of epidemic spreading 
because of the lack of large-scale fine-grained empiri- 
cal data. 

To take a topical example, consider the current swine 
flu outbreak. Scientists have urged governments to map 
the spread of swine flu more accurately in order to pre- 
dict the number of people who may die from it II12I . 
Current predictions indicate that one in 200 people who 
get swine flu badly enough to need medical help could 
go on to die, but given that vaccines may not be ready 
until later than hoped, accurate predictions are crucial, 
available. Any estimates about swine flu are subject 
to a wide margin of error, not least because not every- 
one who catches it develops symptoms. More accurate 
mapping of the spread of the virus must be carried out 
if it is to be effectively managed. Monitoring doctors 
and hospitals is insufficient since not everyone who is 
infected with swine flu will become ill enough to report 
their case to a doctor. 

Figure [T] shows the process of the spreading of epi- 
demics by the mobility of human from a subpopula- 
tion (e.g., a city) to another subpopulation. When a 
susceptible individual (S) is in contact with a infec- 
tious individual (either symptomatic or asymptomatic), 
it will be infected with a certain rate and enter the la- 
tent class. When the latent period ends, the individuals 
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Figure 2: Confirmed number of swine flu cases 
worldwide on May 19 (from GLEaMviz.org) 

become infectious (i.e., able to transmit the infection). 
After the infectious period, all infectious individuals 
enter the recovered class. If an infectious individual 
moves from to another city, the subpopulation in the 
new city will also be infected. Using the lATA data, sci- 
entists can roughly model the migration of population 
across countries. But we need much better granularity 
of data, instead of assuming a homogeneous mixing in 
each subpopulation (city). 

Figure |2] shows the confirmed number of swine flu 
cases world wide on May 19, 2009. It first started in 
Mexico and then spread to other countries by human 
mobility. We can see from the figure that the most worst 
countries beside Mexico are nearby countries such as 
the USA and Canada. Spain was the worst in Europe 
because there are in general a lot of connections be- 
tween Spain and Mexico, but we need better data to 
build models to predict such behaviour 

Mobile computing can help to fight epidemics in at 
least two ways: 

Case 1: If we can track real-time or nearly real-time 
human health status, we can provide advice and pre- 
cautions for each users, accurately estimate the num- 
ber of asymptomatic infectious individuals, predict the 
spreading process, identify the hotspots of the pandemic, 
and effectively isolate the infectious victims. This may 
be possible by using a personalised epidemic software. 
Users can self identify their health status (e.g., cough, 
cold) and embedd this status in a Bluetooth service. 
Users periodically run Bluetooth service discovery and 
log the devices discovered, the health status of each en- 
countered user, and if possible also their geographical 
locations. Users can upload their log files to the server, 
which analyses results and provide effective feedback. 

Case 2: If we do not have the health status of each 
users but only the contact log and the geographical lo- 
cation of certain encounters, we can understand the mix- 
ing properties of each subpopulation, model contact 
and mobility processes, and identify the social hotspots. 
With this understanding, we can accurately predict and 
emulate the spreading of diseases. 



3. CHALLENGES IN COLLECTING DATA 

3.1 High experimental cost 

In general it is expensive to conduct large-scale mo- 
bility experiments. Costs include equipment, software, 
human resources, and generating incentives for peo- 
ple to participate. For example, for the iMote experi- 
ments carried out by the Haggle Project fTSl , the cost 
of iMotes, packages, batteries, participation incentives, 
and the human resources spent on assembling and dis- 
tributing devices, and monitoring the experiments, add 
up to $12,000 (including development) for a small-scale 
experiment with just 50 participants. This is clearly not 
scalable to experiments involving billions of people. 

3.2 Privacy and government regulations 

The law in many jurisdictions strictly regulates pri- 
vacy and thus data collection, making large-scale data 
collection even more challenging lfT4ll . Before data col- 
lection can begin, the consensus of participants is re- 
quired, substantially increasing the administrative bur- 
den. Further, telephone operators are restricted in what 
customer data they can store, for how long, and for 
what purpose, and the dissemination of such data is 
even more tightly controlled. This dramatically increases 
the difficulty of obtaining data from operators, which 
otherwise is a good way to reduce collection cost and 
increase dataset size. 

3.3 Lack of motivating applications 

We can see from the discussion above that it is not 
scalable to give out hardware for large-scale experi- 
ments. Instead we must rely on useful or interesting 
software applications to motivate participation of users 
that already own their own hardware. For example, 
there are many applications developed for iPhones but 
no key application exists that enables large-scale data 
collection. An application able to scale up to millions 
of users while collecting data would be incredibly valu- 
able to the research community (as well as economi- 
cally!). Equal value might be obtained through many 
applications with smaller (but still large) user commu- 
nities: it is not a strict requirement that such a large 
dataset consist of a single community, and indeed, it 
might be valuable in avoiding bias if the overall billion- 
sized dataset were composed of numerous smaller (multi- 
million sized) components. 

3.4 Lack of business models 

To motivate a large amount of participation, we may 
need good business models. Such business models can 
motivate operators to share their data, and users to par- 
ticipate in experiments. If all parties — the operators, 
the users and the researchers — can benefit from par- 
ticipating in a system, it is more likely to succeed. 
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3.5 Lack of organisation 

CAIDA ( |caida.org| exists to aid Internet traffic data 
collection, but there is no such organisation or group 
for data collection in mobile or wireless networks. The 
closest is CRAWDAD ( crawdad.org| l, but that was es- 
tablished only to archive wireless data and, though it 
has performed this role well, it does not currently coor- 
dinate or lead data collection. An organisation for ini- 
tiating, motivating, and coordinating mobile data col- 
lection would be extremely valuable. If such an organ- 
isation cannot be founded then, given the distributed 
and large-scale nature of the problem, crowd-sourcing 
might be utilised to achieve the same goal. 

4. WHAT CAN WE DO? 

It is impractical to provide experimental devices to 
billions of participants. Our strategy is to develop novel 
application software allowing us to utilize crowd-sourcing 
It is also impossible to collect data from billions of peo- 
ple while relying on one group alone: we need collab- 
orative support from the joint force of the research and 
industrial communities to achieve active participation 
of sufficient individual users. The key problem is to 
motivate participation of the community and users by 
providing mutual benefit. 

4.1 New communication and networking 
applications 

Novel and useful communication and networking ap- 
plications can be one efficient way to motivate partici- 
pation. For example the company SenseNetworks 
(lwww.sensenetworks.co m ) provide a innovative mobile 
application for real-time nightlife discovery and social 
navigation, answering the question, "Where is every- 
body going right now?" They found that this applica- 
tion attracted around 100,000 users in North America. 
Unfortunately, as with other companies, the data are 
not available to the public but it seems that develop- 
ing useful applications might be a viable way to collect 
large-scale datasets for research purpose. 

Another example from the research community are 
applications designed to encourage users to share their 
mobile phone devices ET\ or calling minutes and text 
messages II16I . Such applications provide an incentive 
for usage and so could be used to motivate participation 
in experiments. 

4.2 A common research platform for mo- 
bility and social network study 

Currently there are several research groups involved 
in human mobility measurements [6 26 , J_9j2Z*i4J, and 
we expect more researchers will move into the area in 
the near future. Social network research has also re- 
cently become a popular research area, and is often 
integrated with mobility research. In order to moti- 



vate the researchers to create a crowd-sourcing effect, 
we propose the development of an open platform for 
social network and mobility experiment. Researchers 
can create their own online social networks for their re- 
search projects by defining the fields of users' profiles 
according to the need of their experiment, e.g., name, 
email addresses, and Bluetooth ID. Separate projects 
can have different users, but the platform itself will 
merge the database from all projects. When a new 
project starts the central server informs all users about 
this project and invites them to participate. The user 
interface and format for each project are similar, and 
projects can be merged on the platform. The different 
is that each project has a database, and manages its own 
data independently. This will save a lot of effort and 
administrative hassle when collecting and interpreting 
data, and conducting experiments. 

4.3 A social proximity application 

Isolation is usually a problem in metropolitan cities. 
Mobile devices can help to detect the devices in prox- 
imity and help people to notice the "familiar strangers" 
around them. 

Mobile phones can sense the people we meet every- 
day within the radio range and also detect the duration 
of the proximity. Here we suggest a platform includ- 
ing both software running on the mobile client and a 
web based application, allowing the users to build up 
a social network based on the proximity information 
detected. Mobile users can create a profile page on 
the web server by register its Bluetooth ID. The pro- 
file page can be similar to a Facebook page, but having 
additional features which can allow the user to preview 
statistics about the people he met in any period, and 
propose related strategies for subsequent encounters. 
The user can request addition of a particular owner of a 
Bluetooth ID to his friend list as on Facebook. We be- 
lieve this opens a completely new way of socialising. 

For example a user could use his mobile phone to 
detect the Bluetooth ID of someone whom he sees on 
the subway everyday, but to whom he is too scared to 
talk. This could enable him to initiate contact, while 
leaving the other party in control of any communica- 
tion. This application scenario may seem socially un- 
likely in the Western world but it is a common pat- 
tern in Asia. But note that a single Asian population, 
however large, is also unrepresentative: many suitable 
applications, encouraging participation from different 
continents, countries and cultures, may be necessary. 

4.4 Request data from the operators 

We have two ways to request data from the opera- 
tors: either access to anonymised data e.g., via collabo- 
rative research projects; or full access to data as a com- 
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mercial partner, e.g., by providing commercial value to 
the operator through data analysis. An example of the 
former is the access of the Google metropolitan Wi-Fi 
dataset yj. This might be possible if the data can help 
to improve their services or provide them better rev- 
enues, e.g., if understanding human mobility can help 
in Wi-Fi hotspot deployment and placement. For the 
latter approach, one good example is applications like 
Qiro (qiro.net) or SenseNetworks, both of which use 
collaboration with operators to access location infor- 
mation to provide additional services to the users. Qiro 
uses information from T-Mobile, E-Plus, Vodafone and 
02 to help users to locate nearby friends, and faciUties 
such as bicycle rental. 

4.5 Collaboration with local government 
and media 

Local governments are powerful entities for assist- 
ing with data collection. They can help to push appli- 
cations into reality. Some governments seek to develop 
infrastructure and facilities to improve the people's life 
in the metropoUtan area. By collaborating with these 
governments, we can quickly access the resources and 
deploy the facilities. The local media can be also a 
good way to gather mobility information as they are 
often interested in new technologies, wanting to use 
them in future campaign activities. For example, to 
market the movie Artificial Intelligence, an augmented 
reality game based on the movie, called Beasts, was 
created. The game was conceived as an elaborate mur- 
der mystery played out across hundreds of websites, 
email messages, faxes, fake advertisements, and voice- 
mail messages, and involved over three million active 
participants. Collaborating in such activities can gain 
us datasets of millions of people. The UK government 
for the swine flu case can also be a good collaborator 
for the data collection. 

4.6 New sources of data 

The popularity of Web 2.0 and user-generated con- 
tent means that there may well be more available hu- 
man mobility datasets on the Internet, if we know where 
to look. For example, Piorkowski was able to extract 
125,000 short-term mobility traces gathered from a pub- 
licly available web-based repository of GPS tracks ll24l 
- the Nokia Sports Tracker service, which covers mo- 
bility of many urban areas. Another example is photo- 
sharing sites like Flickr. Photo-sharing sites on the In- 
ternet contain billions of publicly accessible images taken 
virtually everywhere on earth, which are annotated with 
various forms of information including geolocation, time, 
photographer, and a wide variety of textual tags. Re- 
searchers have been able to analyse a global collection 
of geo-referenced photographs, and evaluate them on 



nearly 35 million images from Flickr ||9l- 

We believe in order to achieve the goal of planet- 
scale mobility measurement, we need to be more cre- 
ative in collecting and merging information from dif- 
ferent sources, sensing methods, and collaborating with 
different organisations. 

5. CONCLUSION 

In this paper we challenge the networking commu- 
nity to collect planet-scale human mobility traces. We 
explained why large-scale mobility datasets are impor- 
tant for networking research, and how they could im- 
pact fundamental researches in many other academic 
disciplines. We identified the challenges and difficul- 
ties, and further proposed potential methods to achieve 
this goal. 

We in no way claim that we have the ideal strategies 
for collecting and managing such datasets: we would 
go so far as to say that this is an impossible mission for 
a single research group. Our intent with this paper is 
to draw the attention of the community to this problem, 
enabling the collective intelligence of the whole com- 
munity to be brought to bear on these crucial problems. 

With these kind of datasets, we believe that we will 
completely change the understanding of human dynam- 
ics, potentially opening many new fields of academic 
study, as the availability of Internet and WWW web 
data allowed the study of complex networks and sys- 
tems to flourish, further impacting the understanding of 
biological structures, e.g., DNA and proteins. We urge 
the community to address these challenges to make this 
possible, and in doing so perhaps we can help to save 
the world from epidemics like SARS and swine flu. 
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